Chapter 2 Bias assessment

2.1 Bias due to omitted confounders

\[y_i = \beta_0 + \beta_1 x_i + \beta_2 x_2 + \dots + \epsilon_i; \;\; for \;\; i=1, \dots, n\]

where the errors \(\epsilon_i \sim N(0, \sigma^2)\) with independent and identically distributed (i.i.d.)

Let’s assume the following association is true (i.e., gold standard) without any selection bias, measurement bias, and other unmeasured confoundings.

N <- 100000
C <- rnorm(N)
X <- .5 * C + rnorm(N)
Y <- .3 * C + .4 * X + rnorm(N)

2.1.1 Gold standard

With the correct model specification (i.e., \(C\) as a confounder), we get an unbiased estimate of \(X\) on \(Y\).

# Gold standard
glm.unbiased <- glm(Y~X + C, family="gaussian")
summary(glm.unbiased)
## 
## Call:
## glm(formula = Y ~ X + C, family = "gaussian")
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.0002749  0.0031611  -0.087    0.931    
## X            0.3946589  0.0031560 125.050   <2e-16 ***
## C            0.3000315  0.0035321  84.944   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.99923)
## 
##     Null deviance: 140182  on 99999  degrees of freedom
## Residual deviance:  99920  on 99997  degrees of freedom
## AIC: 283716
## 
## Number of Fisher Scoring iterations: 2

2.1.2 Misspecified model: a confounder, \(C\), was omitted from the model

By omitting \(C\), the estimate of \(X\) was biased either “away from” or “towards to” the null

# C was omitted
glm.unbiased <- glm(Y~X, family="gaussian")
summary(glm.unbiased)
## 
## Call:
## glm(formula = Y ~ X, family = "gaussian")
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.0009031  0.0032731  -0.276    0.783    
## X            0.5139919  0.0029263 175.647   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 1.071321)
## 
##     Null deviance: 140182  on 99999  degrees of freedom
## Residual deviance: 107130  on 99998  degrees of freedom
## AIC: 290681
## 
## Number of Fisher Scoring iterations: 2

2.1.3 Bias “away from” or “towards to” the null?

N <- 100000
C <- rnorm(N)
X <- -.5 * C + rnorm(N)
Y <- -.3 * C + .4 * X + rnorm(N)
# C was omitted
glm.unbiased <- glm(Y~X + C, family="gaussian")
summary(glm.unbiased)
## 
## Call:
## glm(formula = Y ~ X + C, family = "gaussian")
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.006081   0.003159  -1.925   0.0542 .  
## X            0.395894   0.003151 125.635   <2e-16 ***
## C           -0.303350   0.003536 -85.791   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.9977328)
## 
##     Null deviance: 140645  on 99999  degrees of freedom
## Residual deviance:  99770  on 99997  degrees of freedom
## AIC: 283566
## 
## Number of Fisher Scoring iterations: 2
glm.unbiased <- glm(Y~X, family="gaussian")
summary(glm.unbiased)
## 
## Call:
## glm(formula = Y ~ X, family = "gaussian")
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.005787   0.003273  -1.768    0.077 .  
## X            0.516742   0.002921 176.927   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 1.071159)
## 
##     Null deviance: 140645  on 99999  degrees of freedom
## Residual deviance: 107114  on 99998  degrees of freedom
## AIC: 290666
## 
## Number of Fisher Scoring iterations: 2

2.1.4 A \(C\) is not a confounder on \(X\) and \(Y\)

N <- 100000
C <- rnorm(N)
X <- rnorm(N)
Y <- .4 * X + rnorm(N)

2.1.5 Correct model specification: Without \(C\)

glm.unbiased <- glm(Y~X, family="gaussian")
summary(glm.unbiased)
## 
## Call:
## glm(formula = Y ~ X, family = "gaussian")
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.002195   0.003155  -0.696    0.486    
## X            0.399377   0.003137 127.296   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.99524)
## 
##     Null deviance: 115649  on 99999  degrees of freedom
## Residual deviance:  99522  on 99998  degrees of freedom
## AIC: 283315
## 
## Number of Fisher Scoring iterations: 2

2.1.6 Misspecified model with \(C\)

glm.unbiased <- glm(Y~X + C, family="gaussian")
summary(glm.unbiased)
## 
## Call:
## glm(formula = Y ~ X + C, family = "gaussian")
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.002178   0.003155  -0.690     0.49    
## X            0.399385   0.003137 127.298   <2e-16 ***
## C           -0.003138   0.003156  -0.994     0.32    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.9952401)
## 
##     Null deviance: 115649  on 99999  degrees of freedom
## Residual deviance:  99521  on 99997  degrees of freedom
## AIC: 283316
## 
## Number of Fisher Scoring iterations: 2

2.1.7 A \(C\) is a colloder on \(X\) and \(Y\)

N <- 100000
X <- rnorm(N)
Y <- .7 * X + rnorm(N)
C <- 1.2 * X + .6 * Y + rnorm(N)

2.1.8 Correct model specification: Without \(C\)

glm.unbiased <- glm(Y~X, family="gaussian")
summary(glm.unbiased)
## 
## Call:
## glm(formula = Y ~ X, family = "gaussian")
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.002388   0.003162   0.755     0.45    
## X           0.699838   0.003164 221.203   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.9996525)
## 
##     Null deviance: 148877  on 99999  degrees of freedom
## Residual deviance:  99963  on 99998  degrees of freedom
## AIC: 283757
## 
## Number of Fisher Scoring iterations: 2

2.1.9 Misspecified model with \(C\)

This is one of examples of selection bias. For example, let’s say, \(X\) is Education, \(Y\) is income, and \(C\) is social welfare program. People at lower education (i.e., high risk group in terms of exposure) and lower income (i.e., higher risk group in terms of outcome) are more likely to register social welfare program. If survey was conducted based on the registered social welfare program, the “estimated” association from this “disproportionally selected” respondents are likely biased.

glm.unbiased <- glm(Y~X + C, family="gaussian")
summary(glm.unbiased)
## 
## Call:
## glm(formula = Y ~ X + C, family = "gaussian")
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.003167   0.002712   1.168  0.24296    
## X           -0.015152   0.004648  -3.260  0.00111 ** 
## C            0.441405   0.002329 189.491  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.735545)
## 
##     Null deviance: 148877  on 99999  degrees of freedom
## Residual deviance:  73552  on 99997  degrees of freedom
## AIC: 253078
## 
## Number of Fisher Scoring iterations: 2

2.2 Overadjustment bias

Please note that this is not a comprehensive example; only reflect one aspect of potential overadjustement bias.

Let’s assume a model with \(M\) as a mediator.

N <- 100000
X <- rnorm(N)
M <- .5 * X + rnorm(N)
Y <- .3 * X + .4 * M + rnorm(N)

2.3 Total effect

glm.unbiased <- glm(Y~X, family="gaussian")
summary(glm.unbiased)
## 
## Call:
## glm(formula = Y ~ X, family = "gaussian")
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.003265   0.003408  -0.958    0.338    
## X            0.498483   0.003400 146.634   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 1.161286)
## 
##     Null deviance: 141096  on 99999  degrees of freedom
## Residual deviance: 116126  on 99998  degrees of freedom
## AIC: 298745
## 
## Number of Fisher Scoring iterations: 2

2.4 Overadjustment

glm.unbiased <- glm(Y~X + M, family="gaussian")
summary(glm.unbiased)
## 
## Call:
## glm(formula = Y ~ X + M, family = "gaussian")
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.003290   0.003166  -1.039    0.299    
## X            0.300176   0.003528  85.084   <2e-16 ***
## M            0.398780   0.003163 126.061   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 1.002054)
## 
##     Null deviance: 141096  on 99999  degrees of freedom
## Residual deviance: 100202  on 99997  degrees of freedom
## AIC: 283998
## 
## Number of Fisher Scoring iterations: 2

2.5 Logistic models

2.5.1 Sex as a Confounder, \(C\)

MYY <- data.frame(Sex = "Male",
                  Smoking = "Yes",
                  Cancer = 1,
                  freq = 5
                  )

MYN <- data.frame(Sex = "Male",
                  Smoking = "Yes",
                  Cancer = 0,
                  freq = 8
                  )

MNY <- data.frame(Sex = "Male",
                  Smoking = "No",
                  Cancer = 1,
                  freq = 45
                  )

MNN <- data.frame(Sex = "Male",
                  Smoking = "No",
                  Cancer = 0,
                  freq = 72
                  )


FYY <- data.frame(Sex = "Female",
                  Smoking = "Yes",
                  Cancer = 1,
                  freq = 25
                  )

FYN <- data.frame(Sex = "Female",
                  Smoking = "Yes",
                  Cancer = 0,
                  freq = 10
                  )

FNY <- data.frame(Sex = "Female",
                  Smoking = "No",
                  Cancer = 1,
                  freq = 25
                  )

FNN <- data.frame(Sex = "Female",
                  Smoking = "No",
                  Cancer = 0,
                  freq = 10
                  )

Ex_confounder <- rbind(MYY, MYN, MNY, MNN, FYY, FYN, FNY, FNN)

Convert Freq table to raw data

library(tidyr)
raw_confounder <- Ex_confounder %>% 
  uncount(freq)
glm.unbiased <- glm(Cancer ~ Smoking , family=binomial(link = "logit"), data=raw_confounder)
summary(glm.unbiased)
## 
## Call:
## glm(formula = Cancer ~ Smoking, family = binomial(link = "logit"), 
##     data = raw_confounder)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)  
## (Intercept)  -0.1582     0.1627  -0.972   0.3309  
## SmokingYes    0.6690     0.3397   1.970   0.0489 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 277.26  on 199  degrees of freedom
## Residual deviance: 273.28  on 198  degrees of freedom
## AIC: 277.28
## 
## Number of Fisher Scoring iterations: 4
  • Full model:
glm_logit <- glm(Cancer ~ Smoking + Sex , family=binomial(link = "logit"), data=raw_confounder)
glm_logit
## 
## Call:  glm(formula = Cancer ~ Smoking + Sex, family = binomial(link = "logit"), 
##     data = raw_confounder)
## 
## Coefficients:
## (Intercept)   SmokingYes      SexMale  
##   9.163e-01    4.266e-15   -1.386e+00  
## 
## Degrees of Freedom: 199 Total (i.e. Null);  197 Residual
## Null Deviance:       277.3 
## Residual Deviance: 257   AIC: 263
  • Stratified models
## For males
raw_confounder_M <- raw_confounder[ which(raw_confounder$Sex=='Male'), ]
glm_logit_m <- glm(Cancer ~ Smoking , family=binomial(link = "logit"), data=raw_confounder_M)
glm_logit_m
## 
## Call:  glm(formula = Cancer ~ Smoking, family = binomial(link = "logit"), 
##     data = raw_confounder_M)
## 
## Coefficients:
## (Intercept)   SmokingYes  
##  -4.700e-01    6.672e-16  
## 
## Degrees of Freedom: 129 Total (i.e. Null);  128 Residual
## Null Deviance:       173.2 
## Residual Deviance: 173.2     AIC: 177.2
# For females
raw_confounder_F <- raw_confounder[ which(raw_confounder$Sex=='Female'), ]
glm_logit_f <- glm(Cancer ~ Smoking , family=binomial(link = "logit"), data=raw_confounder_F)
glm_logit_f
## 
## Call:  glm(formula = Cancer ~ Smoking, family = binomial(link = "logit"), 
##     data = raw_confounder_F)
## 
## Coefficients:
## (Intercept)   SmokingYes  
##   9.163e-01    9.400e-16  
## 
## Degrees of Freedom: 69 Total (i.e. Null);  68 Residual
## Null Deviance:       83.76 
## Residual Deviance: 83.76     AIC: 87.76

2.5.2 Sex as a Moderator, \(M\)

MYY <- data.frame(Sex = "Male",
                  Smoking = "Yes",
                  Cancer = 1,
                  freq = 5
                  )

MYN <- data.frame(Sex = "Male",
                  Smoking = "Yes",
                  Cancer = 0,
                  freq = 4
                  )

MNY <- data.frame(Sex = "Male",
                  Smoking = "No",
                  Cancer = 1,
                  freq = 45
                  )

MNN <- data.frame(Sex = "Male",
                  Smoking = "No",
                  Cancer = 0,
                  freq = 68
                  )


FYY <- data.frame(Sex = "Female",
                  Smoking = "Yes",
                  Cancer = 1,
                  freq = 25
                  )

FYN <- data.frame(Sex = "Female",
                  Smoking = "Yes",
                  Cancer = 0,
                  freq = 14
                  )

FNY <- data.frame(Sex = "Female",
                  Smoking = "No",
                  Cancer = 1,
                  freq = 25
                  )

FNN <- data.frame(Sex = "Female",
                  Smoking = "No",
                  Cancer = 0,
                  freq = 14
                  )

Ex_moderator <- rbind(MYY, MYN, MNY, MNN, FYY, FYN, FNY, FNN)

Convert Freq table to raw data

library(tidyr)
raw_moderator <- Ex_moderator %>% 
  uncount(freq)
  • Full model:
glm_logit <- glm(Cancer ~ Smoking , family=binomial(link = "logit"), data=raw_moderator)
glm_logit
## 
## Call:  glm(formula = Cancer ~ Smoking, family = binomial(link = "logit"), 
##     data = raw_moderator)
## 
## Coefficients:
## (Intercept)   SmokingYes  
##     -0.1582       0.6690  
## 
## Degrees of Freedom: 199 Total (i.e. Null);  198 Residual
## Null Deviance:       277.3 
## Residual Deviance: 273.3     AIC: 277.3
  • Stratified models
## For males
raw_moderator_M <- raw_moderator[ which(raw_moderator$Sex=='Male'), ]
glm_logit_m <- glm(Cancer ~ Smoking , family=binomial(link = "logit"), data=raw_moderator_M)
glm_logit_m
## 
## Call:  glm(formula = Cancer ~ Smoking, family = binomial(link = "logit"), 
##     data = raw_moderator_M)
## 
## Coefficients:
## (Intercept)   SmokingYes  
##     -0.4128       0.6360  
## 
## Degrees of Freedom: 121 Total (i.e. Null);  120 Residual
## Null Deviance:       165.1 
## Residual Deviance: 164.3     AIC: 168.3
# For females
raw_moderator_F <- raw_moderator[ which(raw_moderator$Sex=='Female'), ]
glm_logit_f <- glm(Cancer ~ Smoking , family=binomial(link = "logit"), data=raw_moderator_F)
glm_logit_f
## 
## Call:  glm(formula = Cancer ~ Smoking, family = binomial(link = "logit"), 
##     data = raw_moderator_F)
## 
## Coefficients:
## (Intercept)   SmokingYes  
##   5.798e-01   -2.621e-16  
## 
## Degrees of Freedom: 77 Total (i.e. Null);  76 Residual
## Null Deviance:       101.8 
## Residual Deviance: 101.8     AIC: 105.8